91 research outputs found
Bootstrapping Information Extraction from Field Books
We present two machine learning approaches to information extraction from semi-structured documents that can be used if no annotated training data are available, but there does exist a database filled with information derived from the type of documents to be processed. One approach employs standard supervised learning for information extraction by artificially constructing labelled training data from the contents of the database. The second approach combines unsupervised Hidden Markov modelling with language models. Empirical evaluation of both systems suggests that it is possible to bootstrap a field segmenter from a database alone. The combination of Hidden Markov and language modelling was found to perform best at this task.
Discovering Lexical Generalisations. A Supervised Machine Learning Approach to Inheritance Hierarchy Construction
Institute for Communicating and Collaborative SystemsGrammar development over the last decades has seen a shift away from large inventories of
grammar rules to richer lexical structures. Many modern grammar theories are highly lexicalised.
But simply listing lexical entries typically results in an undesirable amount of redundancy.
Lexical inheritance hierarchies, on the other hand, make it possible to capture linguistic
generalisations and thereby reduce redundancy.
Inheritance hierarchies are usually constructed by hand but this is time-consuming and
often impractical if a lexicon is very large. Constructing hierarchies automatically or semiautomatically
facilitates a more systematic analysis of the lexical data. In addition, lexical data
is often extracted automatically from corpora and this is likely to increase over the coming
years. Therefore it makes sense to go a step further and automate the hierarchical organisation
of lexical data too.
Previous approaches to automatic lexical inheritance hierarchy construction tended to focus
on minimality criteria, aiming for hierarchies that minimised one or more criteria such as the
number of path-value pairs, the number of nodes or the number of inheritance links (Petersen
2001, Barg 1996a, and in a slightly different context: Light 1994). Aiming for minimality is
motivated by the fact that the conciseness of inheritance hierarchies is a main reason for their
use. However, I will argue that there are several problems with minimality-based approaches.
First, minimality is not well defined in the context of lexical inheritance hierarchies as there
is a tension between different minimality criteria. Second, minimality-based approaches tend
to underestimate the importance of linguistic plausibility. While such approaches start with a
definition of minimal redundancy and then try to prove that this leads to plausible hierarchies,
the approach suggested here takes the opposite direction. It starts with a manually built hierarchy
to which a supervised machine learning algorithm is applied with the aim of finding a set
of formal criteria that can guide the construction of plausible hierarchies. Taking this direction
means that it is more likely that the selected criteria do in fact lead to plausible hierarchies.
Using a machine learning technique also has the advantage that the set of criteria can be much
larger than in hand-crafted definitions. Consequently, one can define conciseness in very broad
terms, taking into account interdependencies in the data as well as simple minimality criteria.
This leads to a more fine-grained model of hierarchy quality.
In practice, the method proposed here consists of two components: Galois lattices are used
to define the search space as the set of all generalisations over the input lexicon. Maximum
entropy models which have been trained on a manually built hierarchy are then applied to the
lattice of the input lexicon to distinguish between plausible and implausible generalisations
based on the formal criteria that were found in the training step. An inheritance hierarchy is
then derived by pruning implausible generalisations. The hierarchy is automatically evaluated
by matching it to a manually built hierarchy for the input lexicon.
Automatically constructing lexical hierarchies is a hard task, partly because what is considered
the best hierarchy for a lexicon is to some extent subjective. Supervised learning methods
also suffer from a lack of suitable training data. Hence, a semi-automatic architecture may
be best suited for the task. Therefore, the performance of the system has been tested using a
semi-automatic as well as an automatic architecture and it has also been compared to the performance
achieved by the pruning algorithm suggested by Petersen (2001). The findings show
that the method proposed here is well suited for semi-automatic hierarchy construction
So to Speak: A Computational and Empirical Investigation of Lexical Cohesion of Non-Literal and Literal Expressions in Text
Lexical cohesion is an important device for signaling text organization. In this paper, we investigate to what extent a particular class of expressions which can have a non-literal interpretation participates in the cohesive structure of a text. Specifically, we look at five expressions headed by a verb which â depending on the context â can have either a literal or a non-literal meaning: bounce off the wall (âto be excited and full of nervous energyâ), get oneâs feet wet (âto start a new activity or jobâ), rock the boat (âto disturb the balance or routine of a situationâ), break the ice (âto start to get to know people, to overcome initial shynessâ), and play with fire (âto take part in a dangerous or risky undertakingâ). We look at the problem both from an empirical and a computational perspective. The results from our empirical study suggest that both literal and non-literal expressions exhibit cohesion with their textual context, but that the latter appear to do so to a lesser extent. We also show that an automatically computable semantic relatedness measure based on search engine page counts correlates well with human intuitions about the cohesive structure of a text and can therefore be used to determine the cohesive structure of a text automatically with a reasonable degree of accuracy. This investigation is undertaken from the perspective of computational linguistics. We aim both to model this cohesion computationally and to support our approach to computational modeling with empirical data
The (In-)Consistency of Literary Concepts. Operationalising, Annotating and Detecting Literary Comment
This paper explores how both annotation procedures and automatic
detection (i.e. classifiers) can be used to assess the consistency of textual literary
concepts. We developed an annotation tagset for the âliterary commentâ â a
frequently used but rarely defined concept â and its subtypes (interpretative
comment, attitude comment and metanarrative/metafictional comment) and
trained a multi-output and a binary classifier. The multi-output classifier shows
F-scores of 28% for attitude comment, 36% for interpretative comment and
48% for meta comment, whereas the binary classifier achieves F-scores up to
59%. Crucially, both our annotation and the automatic classification struggle
with the same subtypes of comment, although annotation and classification
follow completely different procedures. Our findings suggest an inconsistency
in the overall literary concept âcommentâ and most prominently the subtypes
âattitude commentâ and âinterpretative commentâ. As a best-practice-example,
our approach illustrates that the contribution of Digital Humanities to Literary
Studies may go beyond the automatic recognition of literary phenomena
- âŠ